Solving the visual symbol grounding problem has long been a goal ofartificial intelligence. The field appears to be advancing closer to this goalwith recent breakthroughs in deep learning for natural language grounding instatic images. In this paper, we propose to translate videos directly tosentences using a unified deep neural network with both convolutional andrecurrent structure. Described video datasets are scarce, and most existingmethods have been applied to toy domains with a small vocabulary of possiblewords. By transferring knowledge from 1.2M+ images with category labels and100,000+ images with captions, our method is able to create sentencedescriptions of open-domain videos with large vocabularies. We compare ourapproach with recent work using language generation metrics, subject, verb, andobject prediction accuracy, and a human evaluation.
展开▼